library(dplyr)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
glimpse(movies_dataset)
Observations: 2,000
Variables: 2
$ class [3m[38;5;246m<chr>[39m[23m "Pos", "Pos", "Pos", "Pos", "Pos", "Pos", "Pos", "Pos", "Pos", …
$ text [3m[38;5;246m<chr>[39m[23m "films adapted from comic books have had plenty of success wh…
In this approach, we represent each word in a document as a token (or feature) and each document as a vector of features. In addition, for simplicity, we disregard word order and focus only on the number of occurrences of each word i.e., we represent each document as a multi-set ‘bag’ of words.
dtm<-movies_dataset %>% select(-class) %>%
mutate(row=row_number())
dtm <- dtm %>% unnest_tokens(word,text) %>% group_by(word,row) %>% summarise(total=n()) %>% cast_sparse(row,word,total)
dtm
str(as.matrix(dtm))
as.matrix(dtm)[1:2,2000:2030]